Sparse Non Gaussian Component Analysis by Semidefinite 

Programming 



Elmar Diederichs * 
Weierstrass Institute 
Mohrenstr. 39, 10117 Berlin, Germany 
diederic@wias-berlin . de 

Arkadi Nemirovski 
ISyE, Georgia Institute of Technology 
Atlanta, Georgia 30332, USA 
nemirovsQisye . gatech . edu 



Anatoli Juditsky 
LJK, Universite J. Fourier, BP 53 38041 
Grenoble cedex 9, France 
j udit skyOimag . f r 

Vladimir Spokoiny 
Weierstrass Institute and Humboldt University, 
Mohrenstr. 39, 10117 Berlin, Germany 
spokoinyOwias-berlin . de 



January 26, 2013 



Abstract 

Sparse non-Gaussian component analysis (SNGCA) is an unsupervised method of extracting a 
linear structure from a high dimensional data based on estimating a low-dimensional non-Gaussian 
data component. In this paper we discuss a new approach to direct estimation of the projector 
on the target space based on semidefinite programming which improves the method sensitivity to 
a broad variety of deviations from normality. 

We also discuss the procedures which allows to recover the structure when its effective dimen- 
sion is unknown. 
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1 Introduction 

Numerous statistical applications are confronted today with the so-called curse of dimensionality (cf. 
|13tl31j). Using high-dimensional datasets implies an exponential increase of computational effort for 
many statistical routines, while the data thin out in the local neighborhood of any given point and 
classical statistical methods become unreliable. When a random phenomenon is observed in the high 
dimensional space the " intrinsic dimension" m covering degrees of freedom associated with same 
features may be much smaller than d. Then introducing structural assumptions allows to reduce 
the problem complexity without sacrificing any statistical information |17l [25] . In this study we 
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consider the case where the phenomenon of interest is (approximately) located in a linear subspace 
X. When compared to other approaches which involve construction of nonlinear mappings from the 
original data space onto the "subspace of interest", such that isomaps |29j, local-linear embedding 
[T3j or Laplacian eigenmaps [H, a linear mapping appears attractive due to its simplicity — it may 
be identified with a simple object, the projector 11* from M*^ onto X. To find the structure of interest 
a statistician may seek for the non-Gaussian components of the data distribution, while its Gaussian 
components, as usual in the statistical literature may be treated as non-informative noise. 

Several techniques of estimating the "non-Gaussian subspace" have been proposed recently. In 
particular, NGCA (for Non-Gaussian Component Analysis) procedure, introduced in [H], and then 
developed into SNGCA (for Sparse NGCA) in [10], is based on the decomposition the problem of 
dimension reduction into two tasks: the first one is to extract from the data a set of candidate 
vectors (3j which are "close" to X. The second is to recover an estimation 11 of the projector 11* on 
X from In this paper we discuss a new method of SNGCA based on Semidefinite Relaxation of 

a nonconvex minmax problem which allows for a direct recovery of 11* . When compared to previous 
implementations of the SNGCA in [6l[7l[T^, the new approach "shortcuts" the intermediary stages 
and makes the best use of available information for estimation of X. Furthermore, it allows to treat 
in a transparent way the case of unknown dimension m of the target space X. 

The paper is organized as follows: in Section [2] we present the setup of SNGCA and briefly 
review some existing techniques. Then in Section [3] we introduce the new approach to recovery 
of the non-Gaussian subspace and analyze its accuracy. Further we provide a simulation study in 
Section [5j where we compare the performance of the proposed algorithm SNGCA to that of some 
known projective methods of feature extraction. 

2 Sparse Non- Gaussian Component Analysis 
2.1 The setup 

The Non-Gaussian Component Analysis (NGCA) approach is based on the assumption that a high 
dimensional distribution tends to be normal in almost any randomly selected direction. This intuitive 
fact can be justified by the central limit theorem when the number of directions tends to infinity. 
It leads to the NGCA-assumption: the data distribution is a superposition of a full dimensional 
Gaussian distribution and a low dimensional non-Gaussian component. In many practical problems 
like clustering or classification, the Gaussian component is uninformative and it is treated as noise. 
The approach suggests to identify the non-Gaussian component and to use it for the further analysis. 

The NGCA set-up can be formalized as follows; cf. [6]. Let Xi,...,Xjv be i.i.d. from a distri- 
bution P in describing the random phenomenon of interest. We suppose that P possesses a 
density p w.r.t. the Lebesgue measure on , which can be decomposed as follows: 

p{x) = <Pf,,j,{x)qiTx). (1) 

Here (p^^T. denotes the density of the multivariate normal distribution A^(/i, S) with parameters 
/i G M*^ (expectation) and S G W^^"^ positive definite (covariance matrix). The function q : — >• M 
with m < d is positive and bounded. T G M™-^'^ is an unknown linear mapping. We refer to 
X = range T as target or non-Gaussian subspace. Note that though T is not uniquely defined, X 
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is well defined, same as the Euclidean projector 11* on X.In what follows, unless it is explicitly 
specified otherwise, we assume that the effective dimension m of X is known a priori. For the sake 
of simplicity we assume that the expectation of X vanishes: E[X] = 0. 

The model ([T]) allows for the following interpretation (cf. Section 2 of [6]): suppose that the 
observation X S M'^ can be decomposed into X = Z + where Z is an "informative low-dimensional 
signal" such that Z G X, X being an m-dimensional subspace of M*^, and ^ is independent and 
Gaussian. One can easily show (see, e.g., Lemma 1 of [S]) that in this case the density of X can be 
represented as ([l]). 

2.2 Basics of SNGCA estimation procedure 

The estimation of X relies upon the following result, proved in [6]: suppose that the function q is 
smooth, then for any smooth function ^/^ : M'^ — )• M the assumptions of and E[X] = ensure 
that for 

m = E[VV(X)] = J VV(x) pix) dx, (2) 

there is a vector f3 £ I such that 

|/3W-/3|2<|S-iE[XVXX)]|2 

where Vip denotes the gradient of tp and | • |p is the standard £p-norm on M'^. In particular, if 
satisfies E[X?/'(X)] = 0, then G X. Consequently 

\{I-U*)/3{iP)\2< [ x^P{x)p{x)dx , (3) 



where / is the d -dimensional identity matrix and 11* is the Euclidean projector on X. 

The above result suggests the following two-stage estimation procedure: first compute a set of 
estimates {/3^} of elements {/3j} of X, then recover an estimation of X from {f3e}- This heuristics 
has been first used to estimate X in [6] . To be more precise, the construction implemented in can 
be summarized as follows: let for a family {hi}, i = 1, ...,L of smooth bounded (test) functions on 

7/= E[X/i,(X)], r//=i'E[V/j,(X)], (4) 

and let 

N N 



7/=i'iV-i J]x,/i,(^^), ve = N-'Y.^heiX,) (5) 

i=l i=l 



be their "empirical counterparts". The set of "approximating vectors" {Pi} used in [6j is as follows: 
Pe = 'Qe — S~^7£, £ = 1, where S is an estimate of the covariance matrix S. The projector 

estimation at the second stage is 11 = YlY=i ^J^J' ^^^^^ ^i' J — 1' are m principal eigenvectors 

of the matrix Yle=i PePj ■ ^ numerical study, provided in p], has shown that the above procedure 
can be used successfully to recover X. On the other hand, such implementation of the two-stage 
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procedure possesses two important drawbacks: it relies upon the estimation of the covariance matrix 
S of the Gaussian component, which can be hard even for moderate dimensions d. Poor estimation 
of S then will result in badly estimated vectors and as a result, poorly estimated I. Further, 
using the eigenvalue decomposition of the matrix X]^=i PiPj entails that the variance of the estima- 
tion n of the projector 11* on I is proportional to the number L of test-functions. As a result, the 
estimation procedure is restricted to utilizing only relatively small families {hi}, and is sensitive to 
the initial selection of " informative" test- functions. 

To circumvent the above limitations of the approach of [H] a different estimation procedure has been 
proposed in [iQj. In that procedure the estimates (3 of vectors from the target space are obtained by 
the method, which was referred to as convex projection. Let c G and let 

L L 

/?(c) = Yl ^^^^ ^(^) = Yl 
1=1 1=1 

Observe that /3(c) G I conditioned that 7(c) = . Indeed, if ■ip{x) = c^hi{x) , then c^'Ei[Xhi{X)] = 
0, and by ([s]), 

r^{c) = Yc'n^h,{X)](iX. 

i 

Therefore, the task of estimating /? G X reduces to that of finding a "good" corresponding coefficient 
vector. In jlO] vectors {cj} are computed as follows: let 

L L 

= ^ c^??£ and 7(c) = ^ 0^7^, £ = 1, L 
1=1 1=1 

and let G M"', j = 1, J constitute a set of probe unit vectors. Then it holds 

Cj = argmin^ll^j - r?(c)|2 | 7(c) =0, |c|i < 1} , (6) 

and we set (3j = (3{cj) = Yle^jVe- Then I is recovered by computing m principal axes of the minimal 
volume ellipsoid (Fritz-John ellipsoid) containing the estimated points {±/3j}j=i • 

The recovery of I through the Fritz-John ellipsoid (instead of eigenvalue decomposition of the 
matrix '^^PjPj) allows to bound the estimation error of I by the maximal error of estimation /3 
of elements of the target space (cf. Theorem 3 of [1^), while the £i-constraint on the coefficients 
Cj allows to control efficiently the maximal stochastic error of the estimations f3j (cf. Theorem 1 of 
|10l [28]). On the other hand, that construction heavily relies upon the choice of the probe vectors 
i^j. Indeed, in order to recover the projector on I, the collection of (3j should comprise at least m 
vectors with non-vanishing projection on the target space. To cope with this problem a multi-stage 
procedure has been used in [lOj: given a set {^j}k=o of probe vectors an estimation Ik=o is computed, 
which is used to draw new probe vectors {(,j}k=i from the vicinity of Xjt=o; these vectors are employed 
to compute a new estimation Ik=i, and so on. The iterative procedure improves significantly the 
accuracy of the recovery of I. Nevertheless, the choice of "informative" probe vectors at the first 
iteration k = remains a challenging task and hitherto is a weak point of the procedure. 



4 



3 Structural Analysis by Semidefinite Programming 



In the present paper we discuss a new choice of vectors /3 which solves the initialization problem of 
probe vectors for the SNGCA procedure in quite a particular way. Namely, the estimation procedure 
we are to present below does not require any probe vectors at all. 



3.1 Informative vectors in the target space 



Further developments are based on the following simple observation. Let rji and 7£ be defined as in 

and let U = [ryi, ?7l] € M"^^^, G = [71,..., 7^] G M'^^^. Using the observation in the previous 

L e 
i=i c li 

In other words, if 11* is the Euclidean projector on I, then 



section we conclude that if c G satisfies Gc = Yle=i ^^If- — then Uc = Yld=i '^^Vi belongs to I. 



{I-U*)Uc = 0. 

Suppose now that the set {h^} of test functions is rich enough in the sense that vectors Uc span I 
when c spans the subspace Gc = 0. Recall that we assume the dimension m of the target space to 
be known. Then projector 11* on I is fully identified as the optimal solution to the problem 



n* 



argminmax < |(/ — n)C/c|2 
n c ' 



n is a projector on an 
m-dimensional subspace of M'^ 



(7) 



c G 



% Gc = 



In practice vectors 7^ and r]i are not available, but we can suppose that their "empirical counterparts" 
- vectors 7^, ffi, i = 1, L can be computed, such that for a set A of probability at least 1 — e. 



\m-'ne\2<SN, \ie-ie\2<m, £ = l,...,L. 



(8) 



Indeed, it is well known (cf., e.g.. Lemma 1 in [T^ or [30]) that if functions hi^x) = f{x,u}£), 
£ = 1, L, are used, where / is continuously differentiable, uj£ G M*^ are vectors on the unit sphere 
and / and Vxf are bounded, then ^ holds with 



5n 



Ci max^gK'i, |a;|2=i \^xf{x,oj)\2N ^ minjd. In L} + In e ^, 
C2 max^gjRd^ \xf{x, w)|2iV"^/^v^min{c?,lnL} +lne-i. 



(9) 



where Ci, C2 are some absolute constants depending on the smoothness properties and the second 
moments of the underlying density. 

Then for any c G M'^ such that |c|i < 1 we can control the error of approximation of C£7£ and 

ceVe with their empirical versions. Namely, we have on A: 



max 

|c|i<l 



< 5]\f and max 

|c|i<l 



^ce{9ie-rje) 
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Let now U = [rji, G = [71, ...,%]. When substituting U and G for U and G into ([7| we come 

to the foUowing minmax problem: 



minmax < |(/ — n){7c|2 
n c ' 



n is a projectoron an m-dimensional 
subspace of 
c E M^, |c|i < 1, |Gc|2 < Q 



(10) 



Here we have substituted the constraint Gc = with the inequahty constraint \Gc\2 < Q for some 
^ > in order to keep the optimal solution c* to ([7]) feasible for the modified problem ( |10[ ) (this 
will be the case with probability at least 1 — e ii g > vn). 

As we will see in a moment, when c runs the z^Ar-neighborhood of intersection Cn of the standard 
hyperoctahedron {c G M^, |c|i < 1} with the subspace Gc = 0, vectors Uc span a close vicinity of 
the target space I. 



3.2 Solution by Semidefinite Relaxation 



Note that (10) is a hard optimization problem. Namely, the candidate maximizers q of (10) are 
the extreme points of the set Cn = {c G M^, |c|i < 1, \Gc\2 < i^n}, and there are 0{L'^) of such 
points. In order to be efficiently solvable, the problem ( |10[ ) is to be "reduced" to a convex-concave 
saddle-point problem, which is, to the best of our knowledge, the only class of minmax problems 
which can be solved efficiently (cf. [18j). 



Thus the next step is to transform the problem in (10) into a convex-concave minmax problem 



using the Semidefinite Relaxation (or SDP-relaxation) technique (see e.g., [5l Chapter 4]). We obtain 



the relaxed version of (10) in two steps. First, let us rewrite the objective function (recall that I — 11 
is also a projector, and thus an idempotent matrix): 

I (/ - U)Uc\l = c^U^{I - UfUc = JU^{I -n)Uc = trace \u^{I - U)UX 

where the positive semidefinite matrix X = c(F is the " new variable" . The constraints on c can be 
easily rewritten for X: 

1. the constraint |c|i < 1 is equivalent to \X\i < 1 (we use the notation = X^f^^^ l^^uDi 

2. because X is positive semidefinite, the constraint \ Gc\2 < is equivalent to into trace [GXG"^] < 

The only "bad" constraint on X is the rank constraint: rankX = 1, and we simply remove it. Now 
we are done with the variable c and we arrive at 



min max < trace 
n X 



U^[l - n)c/x 



n is a projector on an m-dimensional 

subspace of M'^ 
^ ^ 0, \X\i< 1, trace [GXG^] < 



Let us recall that an m-dimensional projector 11 is exactly a symmetric d x d matrix of rank 11 = m 
and trace n = m, with the eigenvalues < Aj(n) < 1, i = l,...,d. Once again we remove the 
"difficult" rank constraint rank 11 = m and finish with 



min max <, trace 
p X 



U^{I - P)UX 



^<P<I, trace P = m, 
^ ^ 0, |X|i < 1, trace [GXG^] < g"^ 



(11) 
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(we write P ^ Q i f the matrix Q — P \s positive semidefinite). There is no reason for an optimal 



solution P of (11) to be a projector matrix. If an estimation of 11* which is itself a projector is 
needed, one can use instead the projector 11 onto the subspace spanned by m principal eigenvectors 
of P. 



Note that ( 11 ) is a linear matrix game with bounded convex domains of its arguments - positive 
semidefinite matrices P £ R'^^'^ and X G M^^^. 

We are about to describe the accuracy of the estimation IT of 11*. To this end we need an 
identifiability assumption on the system {h^} of test functions as follows: 

Assumption 1 Suppose that there are vectors ci,...,Cfn, m < m < L such that \ck\i < 1 and 
Gck = 0, k = 1, ...,rn, and non-negative constants . . . , such that 

m 

U* ^^^i''UckclU^. (12) 



k=l 



We denote fi* = iJ.^ + . . . + /i™. 

In other words, if Assumption [T] holds, then the true projector 11* on T is fj,*x convex combination 
of rank-one matrices Ucc^U'^ where c satisfies the constraint Gc = and |c|i = 1. 

Theorem 1 Suppose that the true dimension m of the subspace I is known and that Q > vn cls in 
Q). Let P be an optimal solution to (11) and let 11 be the projector onto the subspace spanned by 



m principal eigenvectors of P. Then with probability > \ — e: 

(i) for any c such that \c\i < 1 and Gc = 0, 

(ii) further, if Assumption^holds then 



trace 



(/-p)n* <^f{{e + yN)\-^^m + 25N)^ (i3) 



ana 

||n-n*||i < 2/i*(A-[„(E)(£, + i/jv) + 25^)2 T, 

r = {m + l)^{l-^,*{\-;^^{T.){Q + VM) + 25N?)-^ 

1 /2 

(here \\A\\2 = w2,i j = (trace [A"^j4])^''^ is the Frobenius norm of A). 



(14) 



Note that if we were able to solve the minimax problem in (10), we could expect its solution, let us 
call it n, to satisfy with high probability 

|(/ - TL)Uc\2 <{q + i^iv)A-JjS) + 25n 

(cf. the proof of Lemma [T] in the appendix). If we compare this bound to that of the statement (i) 
of Theorem [l| we conclude that the loss of the accuracy resulting from the substitution of ( 10 ) by 
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its treatable approximation (11) is bounded with \/m + 1. In other words, the "price" of the SDP- 



relaxation in our case is y/m + 1 and does not depend on problem dimensions d and L. Furthermore, 
when Assumption 1 holds true, we are able to provide the bound on the accuracy of recovery of 
projector 11* which is seemingly as good as if we were using instead of 11 the solution 11 of (10). 

Suppose now that the test functions he{x) = f{x,u}i) are used, with coi on the unit sphere 
of R'^, that Q = UN is chosen, and that Assumption 1 holds with "not too large" e.g., /x* < 
^(g + ^Af)'^mL(^) + ^N- When substituting the bounds of ^ for 6]\f and uj^j- into (14) we obtain 
the bound for the accuracy of the estimation 11 (with probability 1 — e) : 

||n-n*||^ < C{f)n*N-^ (min(d,lnL) + lne*^) 

where C{f) depends only on /. This bound claims the root-A^ consistency in estimation of the 
non-Gaussian subspace with the log-price for relaxation and estimation error. 



3.3 Case of unknown dimension m 



The problem ( 11 ) may be modified to allow the treatment of the case when the dimension m of the 



target space is unknown a priori. Namely, consider for p> the following problem 



min < t 

P,t 



trace P < t, maxx trace 



U^{I-P)UX 



< ^ P ^ /, 



X hO, \X\i < 1, trace [GXG^] < 



(15) 



The problem (15) is closely related to the ^i-recovery estimator of sparse signals (see, e.g., the 



tutorial [8] and the references therein) and the trace minimization heuristics widely used in the Sparse 
Principal Component Analysis (SPCA) (cf. [2^, '3]). As we will see in an instant, when the parameter 
p of the problem is "properly chosen", the optimal solution P of (15) possesses essentially the same 



properties as that of the problem (11) 



A result analogous to that in Theorem [T] holds: 



Theorem 2 Let P, X and t = trace P he an optimal solution to (15) (note that (15) is clearly 
solvable), in =J?[j^ and let H be theprojector onto the subspace spanned by m principal eigenvectors 
of P. Suppose that g > as in and that 

P>X;^,^{^){g + UN) + 6N- 



(16) 



Then with probability at least 1 — e: 

t<m and \{I - Yi)Uc\2 < Vm + l(p + 26n); 
(ii) furthermore, if Assumption^ hold then 

trace \ {I - P)U*] <h*{p + 6n)^, 

and 

||n-n*||^ < 2p.*{p + 5n)^ [(m + l) fi*{p + 5N) 

(here ||A||2 = (Xli ^-^ Frobenius norm of A). 



2\-ll 



(17) 



^Here Ja[ is the smallest integer > a. 
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The proof of the theorems is postponed until the appendix. 



The estimation procedure based on solving (15) allows to infer the target subspace T without 



o 'priori knowledge of its dimension m. When the constraint parameter p is close to the right-hand 



side of (16), the accuracy of the estimation will be close to that, obtained in the situation when 



dimension m is known. However, the accuracy of the estimation heavily depends on the precision of 
the available (lower) bound for Amin(5^)- In the high-dimensional situation this information is hard 
to acquire, and the necessity to compute this quantity may be considered as a serious drawback of 
the proposed procedure. 



4 Solving the saddle-point problem (11) 



We start with the following simple observation: by using bisection or Newton search in p (note that 
the objective of (15) is obviously convex in p^) we can reduce (15) to a small sequence to feasibility 



problems, closely related to (11): given io report, if exists, P such that 



trace 



max 
X 



U^iI-P)UX 



< ^ P ^ /, trace P < to, 



XhO, \X\i< 1, trace [GXG^] < 



In other words, we can easily solve (15) if for a given m we are able to find an optimal solution to 



(11). Therefore, in the sequel we concentrate on the optimization technique for solving (11) 



4.1 Dual extrapolation algorithm 



In what follows we discuss the dual extrapolation algorithm of [22] for solving a version of (11) 



in which, with a certain abuse, we substitute the inequality constraint trace GXG < q with the 
equality constraint trace [GXCf^] = 0. This way we come to the problem: 



min max trace 



U^{I-P)UX 



(18) 



where 



X = {X eS^, X ^ 0, |X|i < 1, trace [G'^GX] = 0} 
(here stands for the space of L x L symmetric matrices) and 

V = {P eS'^, ^ P ^ I, trace [P] < m}. 



Observe first that ( 18 ) is a matrix game over two convex subsets (of the cone) of positive semidefinite 
matrices. If we use a large number of test functions, say ~ 10^, the size of the variable X rules out 
the possibility of using the interior-point methods. The methodology which appears to be adequate 
in this case is that behind dual extrapolation methods, recently introduced in \19 \ \20 \ [21 } I22j. The 
algorithm we use belongs to the family of subgradient descent- ascent methods for solving convex- 
concave games. Though the rate of convergence of such methods is slow — their precision is only 
0{l/k) , where k is the iteration count, their iteration is relatively cheap, what makes the methods 
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of this type appropriate in the case of high-dimensional problems when the high accuracy is not 
required. 

We start with the general dual extrapolation scheme of [22j for linear matrix games. Let and 
E"* be two Euclidean spaces of dimension n and m respectively, and let ^ C E" and B C E™ be 
closed and convex sets. We consider the problem 

mm mayi{x, Ay) + {a, x) + {b,y). (19) 

Let II • \\x and || • \\y be some norms on E" and E™- respectively. We say that dx (resp., dy) is a 
distance-generating function of A (resp., of B) if (resp., dy) is strongly convex modulus (resp., 
ay) and differentiable on A (resp., on Let for z = {x,y) d{z) = dx{x) + dy{y) (note that d is 
differentiable and strongly convex on ^ x ;B with respect to the norm, defined on ^ x ;S according 
to, e.g. Il^ll = ||x||x + ||y||j/)- We define the prox-function V oiAxB as follows: for zq = (xo,yo) and 
z = {x,y) in A X B we set 

V{zo, z) = d{z) - d{zo) - {Vd{zo),z - zo). (20) 
Next, for s = {sx, Sy) we define the prox-tranform T{zq, s) of s: 

T{zo, s) =^ arg min [(s, z - zq) - V{zo, z)]. (21) 



Let us denote F{z) = {—A^y — a, Ax + b) the vector field of descend-ascend directions of (19) at 
z = (x, y) and let z be the minimizer of d over Ax B. Given vectors Zk.,z'^ ^ Ax B and Sk G E* at 
the /c-th iteration, we define the update z^+i, z'^j^^ and s^+i according to 

Zk+i = T{z,Sk), 

Zk+i = T{zk+i,\kF{zk+i)), 

Sfc+i = Sk + XkF{z^^^), 

where Afc > is the current stepsize. Finally, the current approximate solution z^+i is defined with 

fc+i 



i=i 

The key element of the above construction is the choice of the distance- generaing function d in the 
definition of the prox-function. It should satisfy two requirements: 

• let D be the variation of V over A x B and let a be the parameter of strong convexity of V 
with respect to || • ||. The complexity of the algorithm is proportional to D/a, so this ratio 
should be as small as possible; 



one should be able to compute efficiently the solution to the auxiliary problem (21) which is 
to be solved twice at each iteration of the algorithm. 



^Recall that a (sub-)differentiable on J" function / is called strongly convex on T with respect to the norm || ■ || of 
modulus a if {f'{x) — f'{y), x — y) > a\\x — for all x,y £ T. 
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Note that the prox-transform preserve the additive structure of the distance-generating function. 



Thus, in order to compute the prox-transform on the feasible domain V x X oi (18) we need to 



compute its "P and X components" ~ the corresponding prox-transforms on V and cX. There are 



several evident choices of the prox- functions dp and dx of the domains V and ^ of (18) which 



satisfy the first requirement above and allow to attain the optimal value 0{V'mlndliiL) of the ratio 



D/a for the prox- function y of (18). However, for such distance-generating functions there is no 



known way to compute efficiently the X-component of the prox-transform T in (21) for the set X. 



This is why in order to admit an efficient solution the problem ( 18 ) is to be modified one more time 



4.2 Modified problem 

We act as follows: first we eliminate the linear equality constraint which, taken along with X ^ 0, says 
that X = ZQ with Z ^ and certain Q; assuming that the d rows of G are linearly independent, 
we can choose Q as an appropriate {L — d) x L matrix satisfying QQ^ = I (the orthogonal basis of 
the kernel of G). Note that from the constraints on X it follows that trace [X] < 1, whence 

trace [Q^ZQ] = trace [ZQQ^] = trace [Z] < 1. 

Thus, although there are additional constraints on Z as well, Z belongs to the standard spectahedron 

Z = {Z £ S^^'^, Z^O, trace [Z] < 1}. 

Now can rewrite our problem equivalently as follows: 

min max tvacelU'^ (I - P)U(Q'^ ZQ)]. (22) 

Pev zez,\QTzQ\i<i 

Let, further, 

w = {w e s^, \\w\\2 < 1}, and y = {Y e S^, |y|i < 1}. 



We claim that the problem (22) can be reduced to the saddle point problem 



min max J^trace [U^ {I - P)UY] + \tTace[W {Q^ ZQ - Y)]^ (23) 



F{P,W; Z,Y) 



provided that A is not too small. 

Now, "can be reduced to" means exactly the following: 

Proposition 1 Suppose that A > L\U\2, where \U\2 is the maximal Euclidean norm of columns of 



U. Let {P,W; Z,Y) be a feasible solution e-solution to (23), that is 

{P,W; Z,Y) G (V,W;Z,y), and F{P,W) - F{Z ,9) < e 

where 

F(P,W)= max F(P,W: Z,Y), F(Z,Y) = min F(P,W; Z,Y). 

(z,Y)eZxy > ' /) — V ) / (p,w)evxw 
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Then setting 



Z 



Z, ^ if\Q^ZQ\i<l, 

\Q^ ZQ\^^ Z otherwise, 



the pair (P, Z) is a feasible e-solution to (22). Specifically, we have {P, Z) ^VxZ with \Cf' ZQ\\ < 
1, and 



G{P)-G{Z)<e, 



where 



G{P) 



max iT&ce\U^{I -P)UQ^ZQ]- GiZ) 
min trace [U'^{I - P)UQ^ ZQ]. 



The proof of the proposition is given in the appendix |A.3 



Note that feasible domains of ( 23 ) admit evident distance-generating fmictions. We provide the 



detailed computation of the corresponding prox-transforms in the appendix A. 4 



5 Numerical Experiments 

In this section we compare the numerical performance of the presented approach, which we refer to 
as SNGCA(SDP) with other statistical methods of dimension reduction on the simulated data. 



5.1 Structural adaptation algorithm 

We start with some implementation details of the estimation procedure. We use the choice of the 
test functions hi{x) = f{x,uj() for the SNGCA algorithm as follows: 

f{x,uj) = tanh(u;^x)e~"ll^"2/2, 

where oji, I = 1, L are unit vectors in M'^. 

We implement here a multi-stage variant of the SNGCA (cf tlQj)- At the first stage of the 
SNGCA(SDP) algorithm we assume that the directions W£ are drawn randomly from the unit sphere 
of W^. At each of the following stages we use the current estimation of the target subspace to 
"improve" the choice of directions W£ as follows: we draw a fixed fraction of w's from the estimated 
subspace and draw randomly over the unit square the remaining oj's. The simulation results below 
are present for the estimation procedure with three stages. The size of the set of test function is set 



to L = 10 (i, and the target accuracy of solving the problem (11) is set to le — 4. 
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We can summarize the SNGCA(SDP) algorithm as follows: 
Algorithm 1: SNGCA (SDP) 
7o Initialization: 

The data (^i)^i are re-centered. Let a = (fii, . . . a^) be the standard deviations of the 
components of Xi . We denote Yi = diag(fT~^)Xj the standardized data. 
Set the current estimator Hq = Id- 

% Main iteration loop: 

for i=l to / do 

Sample a fraction of o;^*-' 's from the normal distribution A^(0,nj_i) (zero mean, with 
covariance matrix Ili-i), sample the remaining w^*^ 's from A^(0, I^), then normalize to the 
unit length; 

°/o Compute estimations of r/^ and 7^ 
for t=l to L do 

end 

Solve the corresponding problem ( |ll[ ) and update the estimation Ilj; 
end 



5.2 Experiment description 

Each simulated data set = [Xi, Xj^] of size = 1000 represents N i.i.d. realizations of a 
random vectors X of dimension d. Each simulation is repeated 100 times and we report the average 
over 100 simulations Frobenius norm of the error of estimation of the projection on the target space. 
In the examples below only m = 2 components of X are non-Gaussian with unit variance, other 
d — 2 components of X are independent standard normal r.v.. The densities of the non-Gaussian 
components are chosen as follows: 

(A) Gaussian mixture: 2-dimensional independent Gaussian mixtures with density of each com- 

ponent given by 0.5 (ps^i{x) + 0.5 (j)3^i{x). 

(B) Dependent super-Gaussian: 2-dimensional isotropic distribution with density proportional 

to exp(— llxll). 

(C) Dependent sub-Gaussian: 2-dimensional isotropic uniform with constant positive density 

for \\x\\2 < 1 and otherwise. 

(D) Dependent super- and sub-Gaussian: a component of X, say Xi, follows the Laplace dis- 
tribution £(1) and the other is a dependent uniform U{c,c+ 1), where c = for \Xi\ < In 2 
and c = — 1 otherwise. 
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(E) Dependent sub-Gaussian: 2-dimensional isotropic Cauchy distribution with density propor- 
tional to A(A^ — x'^)^^ where A = 1. 

We provide the 2-d plots of the densities of the non-Gaussian components on Figure [T| 




(A) (B) (C) 

(D) (E) 

Figure 1: (A) independent Gaussian mixtures, (B) isotropic super-Gaussian, (C) isotropic uniform 
and (D) dependent Id Laplacian with additive Id uniform, (E) isotropic sub-Gaussian 

We start with comparing the presented algorithm with Projection Pursuit (PP) method P^] and 
the NGCA for d = 10. The results are presented on Figure [2] (the corresponding results for PP and 
NGCA has been already reported in [TU] and [H]). 




14 



MixedGaussian 



Laplace 



SubGaussian 




PP(tanh) NGCA SNGCA 
LaplaceMix 



PP(tanh) NGCA SNGCA 
Cauchy 




0.04 
0.03 
0.02 
0.01 




PP(tanh) NGCA SNGCA 



PPftanh) NGCA SNGCA 



PP(tanh) NGCA SNGCA 



Figure 2: Comparison of PP, NGCA and SNGCA(SDP) 



Since the minimization procedure of PP tends to be trapped in a local minimum, in each of the 100 
simulations, the PP algorithm is restarted 10 times with random starting points. The best result is 
reported for each PP-simulation. We observe that SNGCA(SDP) outperforms NGCA and PP in all 
tests. 

In the next simulation we study the dependence of the accuracy of the SNGCA(SDP) on the noise 
level and compare it to the corresponding data for PP and NGCA. We present on Figure |3]the results 
of experiments when the non-Gaussian coordinates have unit variance, but the standard deviation 
of the components of the 8-dimensional Gaussian distribution follows the geometrical progression 
10-^ l0-'-+2'-/7, ...,10'- where r = 1, . . . , 8. 
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Figure 3: estimation error with respect to the standard deviation of Gaussian components foUowing 
a geometrical progression on [10^' , 10^'] where r is the parameter on the abscissa 

The conditioning of the covariance matrix heavily influences the estimation error of PP(tanh) and 
NGCA, but not that of SNGCA(SDP). The latter method appears to be insensitive to the differences 
in the noise variance along different direction in all test cases. 

Next we compare the behavior of SNGCA(SDP), PP and NGCA as the dimension of the Gaus- 
sian component increases. On Figure |4] we plot the mean error of estimation against the problem 
dimension d. 
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Figure 4: mean-square estimation error vs problem dimension d 

For PP and NGCA methods we observe that the estimation becomes meaningless (the estimation 
error explodes) already for d = 30 — 40 for the models (A), (C) and for d = 20 — 30 of the model (D). 
In the case of the models (B) and (E) we observe the progressive increase of the error for methods PP 
and NGCA. The proposed method SNGCA(SDP) behaves robustly with respect to the increasing 
dimension of the Gaussian component for all test models. 

5.3 Application to Geometric Analysis of Metastability 

Some biologically active molecules exhibit different large geometric structures at the scale much 
larger than the diameter of the atoms. If there are more than one such structures with the life span 
much larger that the time scale of the local atomic vibrations, the structure is called metastable 
conformation [27]. In other words, metastable conformations of biomolecules can be seen as connected 
subsets of state-space. When compared to the fluctuations within each conformation, the transitions 
between different conformations of a molecule are rare statistical events. Such multi-scale dynamic 
behavior of biomolecules stem from a decomposition of the free energy landscape into particulary 
deep wells each containing many local minima |231I12|. Such wells represent different almost invariant 
geometrical large scale structures [IJ. The macroscopic dynamics is assumed to be a Markov jump 
process, hopping between the metastable sets of the state space while the microscopic dynamics 
within these sets mixes on much shorter time scales |14| . Since the shape of the energy landscape 
and the invariant density of the Markov process are unknown, the "essential degrees of freedom" , in 
which the rare conformational changes occur, are of importance. 

We will now illustrate that SNGCA(SDP) is able to detect a multimodal component of the data 
density as a special case of non-Gaussian subspace in high-dimensional data obtained from molecular 
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dynamics simulation of oligopeptides. 



Clustering of 8-aIanine The first example is a times series, generated by an equilibrium molec- 
ular dynamics simulation of 8-alanine. We only consider the backbone dihedral angles in order to 
determine different conformations. 

The 14-dimensional time series consists of the cyclic data set of all backbone torsion angles. The 
simulation using CHARMM was done at T = 300K with implicit water by means of the solvent model 
ACE2 [26j. A symplectic Verlet integrator with integration step of Ifs was used; the total trajectory 
length was and every r = 50/s a set of coordinates was recorded. 

The dimension reduction reported in the next figure was obtained using SNGCA(SDP) with for 
a given dimension m = 5 of the target space containing the multimodal component. 

8ALA: SNGCA to 5D with entropic sampling and dip index 
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Figure 5: low dimensional multimodal component of 8-alanine 

A concentration of the clustered data in the target space of SNGCA may be clearly observed. In 
comparison, the complement of the target space is almost completely filled with Gaussian noise. 
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Figure 6: Gaussian noise in the complement of the SNGCA target space. 



Clustering of a 3-peptide molecule In the next example we investigate Phenylalanyl-Glycyl- 
Glycine Tripeptide, which is assumed to realize all of the most important folding mechanisms of 
polypeptides The simulation is done using GRDMACS at T = 300K with implicit water. An 

integration step of a symplectic Verlet integrator is set to 2fs, and every r = 50/s a set of 31 diedre 
angles was recorded. As in the previous experience, the dimension of the target space is set to m = 5 
Figureu^shows that the clustered data can be primarily found in the target space of SNGCA(SDP). 
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Speptid: SNGCA to 4D with entropic sampling and dip index 
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Figure 7: low dimensional multimodal component of 3-peptide 



6 Conclusions 

We have studied a new procedure of non-Gaussian component analysis. The suggested method, same 
as the techniques proposed in [6l [10], has two stages: on the first stage certain linear functionals 
of unknown distribution are computed, then this information is used to recover the non-Gaussian 
subspace. The novelty of the proposed approach resides in the new method of non-Gaussian subspace 
identification, based upon semidefinite relaxation. The new procedure allows to overcome the main 
drawbacks of the previous implementations of the NGCA and seems to improve significantly the 
accuracy of estimation. 

On the other hand, the proposed algorithm is computationally demanding. While the first-order 
optimization algorithm we propose allows to treat efficiently the problems which are far beyond 
the reach of classical SDP-optimization techniques, the numerical difficulty seems to be the main 
practical limitation of the proposed approach. 
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A Appendix 

Let X = G M^^^ be positive semidefinite with \X\i < 1, and let Y = X^l"^ be the symmetric 
positive semidefinite square root of X. If we denote yj, i = 1, the columns of y, then \X\\ < 1 
implies that 

i<i,j<L 

We make here one trivial though useful observation: for any matrix A G M.'^^^, when denoting 
we have 



\AY\\l = trace {A^ AX) = trace [BX] = V V B^Xij < max \Bij\ = \A\l. (24) 



(Recall that for a matrix j4 G M'^^^ with columns Oj, i = 1, L, |^|2 stands for the maximal column 
norm: \A\2 = maxi<j<L |ai|2). 

We can rewrite the problem (11) using Y = X^/^, so that the objective 

/(X, P) = trace [U^ {I - P)UX] 



of ( 11 ) becomes 

g{Y,P) = \\{I-PYl^UY\\l 
Let now {X, P) be a saddle point of (11 ). Namely, we have for any feasible P and X: 

f{X, P) < [I ^ f{X, P)] < f{X, P), 

We denote Y = X^l'^ . 

In what follows we suppose that vectors 7^ and rji, £ = 1, L satisfy ([s]). In other words,it holds 
\U-U\2<6n and \G - Gjs < jn- 

A.l Proof of Theorem [H 



Lemma 1 Let P be an optimal solution to (11). Then 



max{|(/-P)i/2[/c|2 I |c|i < 1, Gc = 0} < A-J„(S)(£. + i/7v) + 2<5jv. (25) 
Proof. We write: 

max||(I-P)i/2f;g|2| 1^1^ < 1^ Gc = o} 

< max|||(/-P)^/2;7y||2 1 |y2|i < 1, Gy = 0} 

< max|||(/-P)^/2[;y||2| |y2|^ < 1, Gy = 0} 

+ max|||(/-P)i/2(C/-?7)y||2| \Y\ < 1, GY = 0^ 
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< max{||(/-P)i/2^/y||2| |y2|, < i, ||Gy||2 < ^} 
(by m) +\{I-Pf/\U-U)\2 



(duetoO^/-P^/) = -P)^/^UY\\2 + 5n <\\{I -^*Y'^UY\\2 + 5n 
(again by m) < \\{I-Il*f'^UY\\2 + 25N. 



On the other hand, as ||Gy||2 < z^at, we get 

||Gy||2 < ||Gy||2 + ||(G-G)y||2 <q+\g-g\2 < q + un, 

and by ^, 

\\{I-U*)UY\\2<X^,{^){g + m). 



This imphes (25). 



□ 



We now come back to the proof of the theorem. Let Xj and dj, j = 1,.. . ,d be respectively the 
eigenvalues and the eigenvectors of P. Assume that Ai > A2 > . . . > A^. Then P = Yl'j=i ^j^j^ 
and n = Yl^=i ■ P = Uc for c such that |c|i < 1 and Gc = 0. We have 



/3^(/-P)/3 = ^(l-A,)(^/3)2 + ^(l-A,)(^/3)2 



i=i 



> 5^(l-A,)(^/3)2>(l-W)(^/3)2 



j>m 



(1 - A^+i)/3^(/ - n)/3 = (1 - A™+i)|(/ - n) 



Since, for obvious reasons, A^+i < applies (i) due to (25). 



Let us show (ii). We have due to (12) and (25): 



trace ^(/ - P)n* = trace (/ - P)^/2n*(I - 

m 

< X^/^fctrace [{I - P^/'UcclU^ {I - P)'/'] = - P)'/'Uc,\l 

k=l k 
m 

< ^//fcmax||(I-P)^/2f7g|2| 1^1^ < Gc = o} 
k=i 



k=l 



(26) 



which is (13). 



Note that trace [PIi*\ < T.j<m (cf, e.g.. Corollary 4.3.18 of [15J), thus by (|26|) 



Xm+i < m - ^ A, < trace [(/ - P)U*] < ^*(A-JjS)(^> + un) + 26n)' 

j<m 
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On the other hand, 



trace [(/ - P)U*] = ^(1 - Xj)ejn*9j + X] " 



j>m 



> ^ (1 - Xj)efu% > (1 - Xm+i)eju*'' 



(1 - A„+i)trace [(/ - n)n* 



and we conclude that 



trace [(/-n)n*] < P)^*] < /^*(A- + + 2c^.)^ 



1-A^+i l-fi*{Xj^{^){0 + iyN) + 26Nr 

Now, usmg the relation trace 11 = trace 11* = m, we come to 

||n - U*\\l = trace [ft^ - 2nn* + {U*f] = 2m - 2trace [fin*] = 2trace [(/ - fi)n*] 



and we arrive at (14). 



A. 2 Proof of Theorem I 



Let now P, X and t = trace P be a triplet of optimal solution to (15) 



Lemma 2 Let P he an optimal solution to (15) 



(i) In the premises of the theorem IT* is a feasible solution of (15) and trace P < trace 11* = m. 
(a) We have 

uia-K^\{I -Pf^Uc\2 I |c|i < 1, Gc = o} <p + 6n- 
Proof. We act as in the proof of Lemma [T] to verify (i) we observe that 



(27) 



max I trace [U^ {I - n*)UX] 
-Tl*)UY\\l 



max ■ 
Y 



^ ^ 0, \X\i< 1, trace [GXG^] < } 
\Y\ < 1, ||Gy||2 < q] 



< max|(||(/-n*)C/y||2 + (57v)^ 



\Y\ < I, \\GY\\2 < g 



Thus, if p > Xj^-,^{Il){g+i']\i)+6N, LE* is a feasible solution of ( 15 ) and, as a result, trace P < trace 11* 
To show (a) it suffices to note that 



max 

c 



{|(/-P)i/2c/c|2| |c|i < 1, Gc = o} 

< max I II (/- p)i/2^y 1I2 1 |y^|i < 1, Gy = o| 

< max|||(/-P)i/2?7y||2 I \Y^\i<l, GY = 0^ + \{I-P)^^^{U-U)l^ 



< max I II (/- P)i/2f7y 



|>^'|i < 1, ||Gy||2 < Q 



< P + Sn 
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because of the feasibility of P. □ 
Now using the bound m < m we complete the proof following exactly the lines of the proof of 
Theorem [H 

A. 3 Proof of Proposition [l] 

Observe that 

F(Z,Y) = min \tvace[B^(I - P)BQ'^ZQ] + XtTace[W(Q^ZQ -Y)]\ 

{P,W)'=:VXW ^ ) 

= min jtrace [B^{I - P)BQ^ZQ] - X \\Q^ZQ - yjla} 

< min (trace [B^(I - P)BQ^ZQ] } = GiZ); (28) 

and 

F{P,W) = ^^max ^ jtrace [^'^(I - P)SQ'^Zg] + A trace [VF(Q^ZQ - y)]} 

> max (trace - P)BQ'^ZQ] + A trace [WiQ'^ZQ - Y)] } 

= GiP): 

Assume first that \Q'^ZQ\i < 1. In this case Z = Z and 

e > F{P, W) - F{Z, Y) = G{P) - G{Z) = G{P) - G{Z) 



(the second ^ is given by (28)), as claimed. Now assume that s = Q ZQ\\ > 1. We have already 



established the first equality of the following chain: 

F(Z, Y) = min jtrace {B^{I - P)BQ^ZQ) - A \\Q^ZQ - Y\\2^ 

< min jtrace [5^(1 - P)BQ^ZQ] - ^\Q^ZQ - Y\i 

< min jtrace [B^{I - P)BQ^ZQ] - ^(s - 1] 
= min jstrace [B^{I - P)BQ^ZQ] - ^{s - I] 



< min < 

Per 



trace [B^{I - P)BQ^ ZQ] + {s - 1)|5^(I - P)B\^\Q^ ZQ\i -^(s - 1) 



<(s~l)\BT B\^={s-l)\B\l 

< min jtrace [B^{I - P)BQ^ZQ]^ = G{Z), 

where the concluding < is readily given by the definition of AJ^ Further, we have already seen that 

F{P,W) > G{P). 



^We denote |A|oo = max^j \Aij\ 
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Consequently, 
as claimed. 



e > F{P, W) - F{Z, Y) > G{P) - G{Z), 



□ 



A. 4 Computing the prox-transform 

Recall that because of the additivity of the distance-generating function d the computation of the 
prox-transform on the set V xW x Z xy can be decomposed into independent computations on the 



four domains of (23). 



Prox-transform on V. The proxy-function of V is the matrix entropy: 



d{Po,P) = /3ptrace 



- - 

m \ \m 



In 



m 



for P,Po eV, pp> 0. 



To compute the corresponding component of T we need to find, given S E S'^, 



Tg(Poi S) = arg max < trace [SiP — Pq)] — /3p trace 
PeV [ 

S+^ln(^]]P 
m \ m 



- f In f - 1 - In 
m \ \m 



Po 



m 



(29) 



arg max < trace 
Pev ' 



Pp trace 



-'"(-) 

m \m I 



By the symmetry considerations we conclude that the optimal solution of this problem is diagonal 
in the basis of eigenvectors of S* -|- ^ In(^). Thus the solution of (29) can be obtained as follows: 
compute the eigenvalue decomposition 

5 + ^in(^) = rAr^ 

m m 

and let A be the diagonal of A. Then solve the "vector" problem 

d 



* \T Pp 
p = arg max A p 

o<p<i,Ep<"i 



y^Pilnjpi/m). 



(30) 



i=l 



and compose 



T^(P,5) = rdiag(y*)r^ 



Now, the solution of ( |30[ ) can be obtained by simple bisection: indeed, using Lagrange duality 

I A 1, i = I, d, 



we conclude that the components of y* satisfies 

'A. 



Pi = exp 



/3 



and the Lagrange multiplier v is to be set to obtain "^p* = m, what can be done by bisection in v. 



When the solution is obtained, the optimal value of (29) can be easily computed. 
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Prox-transform on W. The distance-generating function of W is /3vi^trace [W^^]/2 = ||M^||2/2 so 
that we have to solve for S £ 

Tp{Wo,S) = arg max (trace [S{W - Wo)] - ^ ~ 1 . (31) 
||VK||2<1 t 2 Z J 



The optimal solution to (31) can be easily computed 

r (w ^) = l ^0 + "^/^^ 11^0 + ^ ^' 

/3l>'^'0,^J \ (VFo + 5//3t^)/||VFo + 5//3H^||2 if ||m) + S//3iy||2 > 1. 

Prox-transform on Z. The prox- function of Z is the matrix entropy and we have to solve for 
S G S^-^ 

Tji{Z, S) = argmaxtrace [S{Z — Zq)] — /3^trace [Z{ln{Z) — In(Zo))] 
= argmaxtrace [(S" + ln(Zo))Z] — /3^trace [Z In Z] . 

Once again, in the basis of eigenvectors of S + (3z In(Zo) the problem reduces to 

d 

z* = arg max X^z — Pz / ^ Zi In(zj), 

2>o, E2<i , 

where A is the diagonal of A with S + Pz In Zq = FAF-^. In this case 

exp(^) 



Ef=iexp(^; 



1 = 1-1 .■.,L — d. 



^3-- 

Prox-transform on y. The distance generating function for the domain y is defined as follows: 

{L n 
{uij Muij] + Vij ln[-Uij] ) : + = 1, 

i,j=l i=l 

Yij = Uij - Vij, Uij > 0, Vij > 0, 1 <i,j < L} . 

In other words, the element y G 3^ is decomposed according to y = u — t;, where (n, v) is an element 
of the 2L^-dimensional simplex A = To find the y-component of the 

prox-transform amounts to find for S £ 



arg max trace [Siu — v)] — /3y 7 



Uij\n{-^) + Vij\Ti{^) 



(32) 



One can easily obtain an explicit solution to (32): let 



aij = Uij exp ( — j , hij = Vij exp — 



28 



Then Tp^{Yo, S)=u* -v*, where 

* ^ ajj * ^ bjj 
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